基于机器学习方法研究HIV-1整合酶LEDGF/p75相互作用抑制剂的构效关系
Study of Structure-Active Relationship for Inhibitors of HIV-1 Integrase LEDGF/p75 Interaction by Machine Learning Methods
Li, Y.; Wu, Y.B.; Yan, A.X.*
Molecular Informatics, 2017, 36(7), 1600127.
HIV-1整合酶(IN)是抗艾滋病治疗的一个有效靶点, LEDGF/p75被证明可以增强HIV-1整合酶的体外链转移活性, 阻断IN和LEDGF/p75之间的相互作用是抑制HIV复制感染的有效方法。在本工作中,收集了274个LEDGF/p75-IN抑制剂作为数据集。应用支持向量机(SVM), 决策树(DT),功能树(FT)和随机森林(RF)算法来构建多个计算模型,以预测化合物是高活性还是弱活性LEDGF/p75-IN抑制剂。 每个化合物均由MACCS指纹和CORINA Symphony描述符表征。所有模型对测试集的预测正确率均高于70%。最佳模型Model 3B由FT建立, 其对测试集的预测正确率为81.08%,Matthews相关系数(MCC)为0.62。此外,我们发现氢键和疏水相互作用对于抑制剂的生物活性起重要作用。
HIV-1 integrase (IN) is a promising target for anti-AIDS therapy, and LEDGF/p75 is proved to enhance the HIV-1 integrase strand transfer activity in vitro. Blocking the interaction between IN and LEDGF/p75 is an effective way to inhibit HIV replication infection. In this work, 274 LEDGF/p75-IN inhibitors were collected as the dataset. Support Vector Machine (SVM), Decision Tree (DT), Function Tree (FT) and Random Forest (RF) were applied to build several computational models for predicting whether a compound is an active or weakly active LEDGF/p75-IN inhibitor. Each compound is represented by MACCS fingerprints and CORINA Symphony descriptors. The prediction accuracies for the test sets of all the models are over 70 %. The best model Model 3B built by FT obtained a prediction accuracy and a Matthews Correlation Coefficient (MCC) of 81.08 % and 0.62 on test set, respectively. We found that the hydrogen bond and hydrophobic interactions are important for the bioactivity of an inhibitor.
Classification Models performance: Dataset (274 LEDGF/p75-IN inhibitors)
Model Name | Algorithm | Descriptors | Training set accuracy (%) | Training set 5-fold cross-validation accuracy (%) | Training set 10-fold cross-validation accuracy (%) | Training set LOO-fold cross-validation accuracy (%) | Test set SE | Test set SP | Test set accuracy (%) | Test set MCC | Test set AUC |
---|---|---|---|---|---|---|---|---|---|---|---|
Model 1A | SVM | MACCS | 74.00 | 64.00 | 65.00 | 66.50 | 0.67 | 0.82 | 71.62 | 0.45 | 0.706 |
Model 2A | DT | MACCS | 76.00 | 61.50 | 58.00 | 59.50 | 0.70 | 0.81 | 74.32 | 0.49 | 0.835 |
Model 3A | FT | MACCS | 74.50 | 61.50 | 60.00 | 60.00 | 0.77 | 0.83 | 79.73 | 0.60 | 0.775 |
Model 4A | RF | MACCS | 76.00 | 62.50 | 61.00 | 63.50 | 0.71 | 0.84 | 75.68 | 0.53 | 0.841 |
Model 1B | SVM | CORINA | 81.00 | 75.50 | 75.50 | 75.00 | 0.79 | 0.77 | 78.38 | 0.57 | 0.783 |
Model 2B | DT | CORINA | 78.00 | 63.00 | 65.00 | 69.00 | 0.84 | 0.70 | 75.68 | 0.53 | 0.784 |
Model 3B | FT | CORINA | 83.50 | 68.50 | 70.50 | 70.00 | 0.82 | 0.80 | 81.08 | 0.62 | 0.831 |
Model 4B | RF | CORINA | 80.50 | 65.00 | 63.50 | 65.50 | 0.79 | 0.77 | 78.38 | 0.57 | 0.837 |
主要项目成员
博士研究生